AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note:
After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.
On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# to split data into training and test sets
from sklearn.model_selection import train_test_split
# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# to tune different models
from sklearn.model_selection import GridSearchCV
# to compute classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
recall_score,
precision_score,
f1_score,
)
# Mounting Google Colab drive
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset
customer_data = pd.read_csv("/content/drive/MyDrive/Python/Loan_Modelling.csv")
# copying the data to another variable to avoid any changes to original data
df = customer_data.copy()
# viewing the first 5 rows of the data
df.head()
# viewing the last 5 rows of the data
df.tail()
#Checking the shape of the dataset
df.shape
# checking datatypes and number of non-null values for each column
df.info()
#Checking statistical Summary of the data frame
df.describe(include="all").T
# checking for missing values
df.isnull().sum()
# checking for duplicate values
df.duplicated().sum()
# checking the number of unique values in each column
df.nunique()
#check unqiue values of Family
unique_values = df['Family'].unique()
print(f"Unique values in 'Column1': {unique_values}")
unique_values = df['ZIPCode'].unique()
print(f"Unique values in 'Column1': {unique_values}")
# Find the unique values of experience to check negative values
experiene_values = df['Experience'].unique()
print(experiene_values)
# Check how many rows in Experience column has negative values
negative_values = df[df['Experience'] <0].shape[0]
print(negative_values)
Nan imputation is avoided for an unforseeable issues in model building. Imputation by mean is avoided here as the mean experience is approximately 20 years and that can have asignificant impact on the Experience Data
Questions:
#Customer ID does not have any implication in EDA, hence this column can be dropped
df.drop('ID', axis=1, inplace=True)
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
#Plot Income (Histogram Boxplot)
histogram_boxplot(df, "Income")
Observations
#Plot Monthly CC Average Spend (Histogram Boxplot)
histogram_boxplot(df, "CCAvg")
Observations
#Plot Mortgage(Histogram Boxplot)
histogram_boxplot(df, "Mortgage")
#Plot Age
histogram_boxplot(df, "Age")
#Plot Experience
histogram_boxplot(df, "Experience")
#Plot ZIPCode
histogram_boxplot(df, "ZIPCode")
# Observations on Family
labeled_barplot(df, "Family", perc=True)
labeled_barplot(df, "Education", perc=True)
labeled_barplot(df, "Personal_Loan", perc=True)
In US, banks do not co relate loan eligibilty criteria with Zipcodes.Hence this field can be drooped from bivariate analysis as it does not does not have any implication
df.drop('ZIPCode', axis=1, inplace=True)
# defining the size of the plot
plt.figure(figsize=(30, 20))
# plotting the heatmap for correlation
# defining the list of numerical features to plot
num_features = ['Age', 'Experience','ZIPCode', 'Income','CCAvg','Mortgage']
#sns.pairplot(data=df,vars=num_features, diag_kind="kde",corner='True',hue='Personal_Loan')
plt.show()
sns.heatmap(
df[num_features].corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
);
# defining the figure size
plt.figure(figsize=(15, 10))
# defining the list of numerical features to plot
num_features = ['Age', 'Experience','ZIPCode', 'Income','CCAvg','Mortgage']
sns.pairplot(data=df,vars=num_features, diag_kind="kde",corner='True',hue='Personal_Loan')
plt.show()
Let's see how the target variable varies across the type of Family
stacked_barplot(df, "Family", "Personal_Loan")
stacked_barplot(df, "Education", "Personal_Loan")
stacked_barplot(df, "CreditCard", "Personal_Loan")
** Around 70% of loan purcahes are non Credit Card users showing no obvious relation to CreditCard users and previous loan purchasers
stacked_barplot(df, "Online", "Personal_Loan")
stacked_barplot(df, "Securities_Account", "Personal_Loan")
stacked_barplot(df, "CD_Account", "Personal_Loan")
Let's analyze the relation between Income and Personal Loan.
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
Observations
Overall EDA Insighst
# Dropping ZIP Code, ID column was dropped at EDA
df.drop(['ZIPCode'], axis=1, inplace=True)
# Treating negative values of Experince column by taking absolute values
df['Experience'] = df['Experience'].abs()
# defining the explanatory (independent) and response (dependent) variables
X = df.drop(["Personal_Loan"], axis=1)
y = df["Personal_Loan"]
# specifying the datatype of the independent variables data frame
X = X.astype(float)
X.head()
Creating training and test sets.
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
We had seen that around 90.4% of observations belongs to class 0 (No Personal Loan) and 9.6% observations belongs to class 1 (Took Persoanl Loan), and this is preserved in the train and test sets
Model can make wrong predictions as:
Which case is more important?
If we predict that a customer will not take loan , the bank does not target the customer with the promotion campaign and lose out on the opportunity to target the customer who will ppurchase the loan then the company will have to bear the cost of repair/replacement and also face equipment downtime losses
If we predict that a customer will purchase a loan but in reality, the customer does not take a loan , then the company will have to bear the loss incurred in targetted campaign The campaign promotion cost is budgeted for and not a loss compared to losing out on the opportunity to target the potential customer and convert into revenue.
How to increase revenue ?
The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model0 = DecisionTreeClassifier(criterion="gini",random_state=42)
model0.fit(X_train, y_train)
confusion_matrix_sklearn(model0, X_train, y_train)
decision_tree_default_perf_train = model_performance_classification_sklearn(
model0, X_train, y_train
)
decision_tree_default_perf_train
confusion_matrix_sklearn(model0, X_test, y_test)
decision_tree_default_perf_test = model_performance_classification_sklearn(
model0, X_test, y_test
)
decision_tree_default_perf_test
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model0,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
Using the above extracted decision rules we can make interpretations from the decision tree model like:
--- Income <= 98.50 --- CCAvg > 2.95 -- CD_Account <= 0.50 -- Age <= 27.00
If the
Income is less than equal to 98.5 k & monthly CCAvg spend is greater than 2.95 & CD Account is less than equal to 0.5 & age is less than equal to 27 , the Customer is most likely to purchase Personal Loan
Note: Interpretations from other decision rules can be made similarly.
importances = model0.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes
In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data
class_weight is a hyperparameter for the decision tree classifier
model1 = DecisionTreeClassifier(criterion="gini",random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
confusion_matrix_sklearn(model1, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model1, X_train, y_train
)
decision_tree_perf_train
confusion_matrix_sklearn(model1, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(
model1, X_test, y_test
)
decision_tree_perf_test
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model0,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=True,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
importances = model0.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
criterion='gini',
random_state=42
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
# creating an instance of the best model
model2 = best_estimator
# fitting the best model to the training data
model2.fit(X_train, y_train)
confusion_matrix_sklearn(model2, X_train, y_train)
decision_tree_tune_perf_train = model_performance_classification_sklearn(
model2, X_train, y_train
)
decision_tree_tune_perf_train
*Observations
confusion_matrix_sklearn(model2, X_test, y_test)
decision_tree_tune_perf_test = model_performance_classification_sklearn(
model2, X_test, y_test
)
decision_tree_tune_perf_test
feature_names = list(X_train.columns)
importances = model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model2, feature_names=feature_names, show_weights=True))
Observation
If income is less than equal to 92.5 k and CCAvd spend is greater than 2.95 k monthly, the customer is most likely to prchase loan
If the income is greater than 92.5 K and education is greated than 1, the customer is likely to purchase loan
importances = model2.feature_importances_
importances
# importance of features in the tree building
importances = model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
DecisionTreeClassifier, this pruning technique is parameterized by the
cost complexity parameter, ccp_alpha.ccp_alpha increase the number of nodes pruned.ccp_alpha on regularizing the trees and how to choose the optimal ccp_alpha value.Total impurity of leaves vs effective alphas of pruned tree
Minimal cost complexity pruning recursively finds the node with the "weakest
link". The weakest link is characterized by an effective alpha, where the
nodes with the smallest effective alpha are pruned first. To get an idea of
what values of ccp_alpha could be appropriate, scikit-learn provides
DecisionTreeClassifier.cost_complexity_pruning_path that returns the
effective alphas and the corresponding total leaf impurities at each step of
the pruning process. As alpha increases, more of the tree is pruned, which
increases the total impurity of its leaves.
clf = DecisionTreeClassifier(criterion='gini',random_state=42, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
pd.DataFrame(path)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=42, ccp_alpha=ccp_alpha, class_weight="balanced",criterion='gini'
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
model4 = best_model
confusion_matrix_sklearn(model4, X_train, y_train)
decision_tree_post_perf_train = model_performance_classification_sklearn(
model4, X_train, y_train
)
decision_tree_post_perf_train
confusion_matrix_sklearn(model4, X_test, y_test)
decision_tree_post_test = model_performance_classification_sklearn(
model4, X_test, y_test
)
decision_tree_post_test
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
model4,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
*Observations
# Text report showing the rules of a decision tree -
print(tree.export_text(model4, feature_names=feature_names, show_weights=True))
importances = model4.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_default_perf_train.T,
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_default_perf_test.T,
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_post_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree (sklearn default)",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Conclusions
The aim was to build a model with highest recall in order to target every positive instance for this business case where missing a positive instance(customer who will buy loan) will be costly.
The pre Puning and Post pruning decision trees, both fulfil this crieria. While comparing other metrics like prescion, accurancy , F1 score , Post Pruning decision should be selected because:
FINAL MODEL SELECTION
The model built will target 100% of the customers who have the likelihood of Purchasing loan
Based on the insights derived from models above, Banks Marketing department should look into following key observations to predict which segments should be targeted to maximise conversion of liabilty customers
For income <=92.5 ,target customers with
- Higher Avg CC Spend(2.95 to 4.35k per/months)
For income >92.5 , target accounting in factors Education and Family Size along with CCAvg
-For lower education level(1) ,
-smaller families (1,2) who have a higher Avg CC Spend >3.3
-Bigger families(3,4)
-For higher education level(2,3) ,
- Customers with income less than 106.5 and CCAvg<=2.45
- CCAvg 2.45
For income >114, target all customers